Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ߪො^ଶൌvarሺݔሻൌ

ܰെ1 ^෍ሺݔ^௡^െߤሻ^ଶ

ே

௡ୀଵ

(2.3)

d on these two Gaussian distributional statistics, an estimated

density function is defined as below,

݂ሺݔሻൌ࣡ሺݔ|ߤ̂, ߪො^ଶሻൌ

√2ߨൈߪො^ଶ^݁^{ିሺ௫ିఓෝሻ}^మ

ఙෝ^మ

(2.4)

stance, if a vector x = (1.3817, 0.1948, −0.1481, −3.2131, 0.0733,

1.6337, −0.7869, 0.4848, −0.3497) was expected to follow a

distribution, but the two distributional statistics (ߤ and ߪ^ଶ) were

, the above equations can be used to estimate these two statistics.

culation, the estimated population mean ߤ̂ was −0.13 and the

d population variance ߪො^ଶ was 1.79 for the data set x. The

d Gaussian distribution for this data set is thus shown below,

݂ሺݔሻൌ࣡ሺݔ| െ0.13, 1.79ሻൌ

√2ߨൈ1.79

݁^{ିሺ௫ା଴.ଵଷሻ}^మ

ଵ.଻ଽ

parametric approach for estimating a Gaussian distribution is

nted by the dnorm function in R. The function needs three inputs.

input is a vector. The second input specifies the mean of the data,

fault value is zero. The third input is the standard deviation of the

ose default value is one.

dnorm(x,mean=0,sd=1)

ussian distribution for a data set can be generated based on the

d population mean and the estimated population variance, i.e., ߤ̂

Here, the breast cancer diagnosis data set [Wolberg, et al., 1994;

et al., 1995] was used for this demonstration. The data set was

d of 30 mammographic features for breast cancer diagnosis. The

amed as radius (tumour radius) was used for this demonstration.

gn tumours were separated from the malignant tumours at first.